In single table queries, it is usually unambiguous to the query engine which column and which table you need to query.
However, when you involve multiple tables, it is important to know how to refer to a column in a specific table.
For example:
| person_id | year_of_birth |
|---|---|
| 6 | 1963 |
| 123 | 1950 |
| 129 | 1974 |
| 16 | 1971 |
| 65 | 1967 |
| 74 | 1972 |
| 42 | 1909 |
| 187 | 1945 |
| 18 | 1965 |
| 111 | 1975 |
Your turn to use table references:
| procedure_occurrence_id | person_id | procedure_concept_id | procedure_date | procedure_datetime | procedure_type_concept_id | modifier_concept_id | quantity | provider_id | visit_occurrence_id | visit_detail_id | procedure_source_value | procedure_source_concept_id | modifier_source_value |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 44783196 | 1981-08-17 | 1981-08-17 | 38000275 | 0 | NA | NA | 85 | 0 | 699253003 | 44783196 | NA |
| 2 | 1 | 4125906 | 1982-09-11 | 1982-09-11 | 38000275 | 0 | NA | NA | 83 | 0 | 288086009 | 4125906 | NA |
| 3 | 1 | 4252419 | 1981-08-10 | 1981-08-10 | 38000275 | 0 | NA | NA | 82 | 0 | 74016001 | 4252419 | NA |
| 4 | 1 | 4170947 | 1958-03-11 | 1958-03-11 | 38000275 | 0 | NA | NA | 79 | 0 | 274474001 | 4170947 | NA |
| 5 | 1 | 4047491 | 1958-03-11 | 1958-03-11 | 38000275 | 0 | NA | NA | 79 | 0 | 1225002 | 4047491 | NA |
For each person_id in the person table, there may be duplicated person_ids in procedure_occurrence table, as a patient can have multiple procedures. This is a one-to-many relationship.
Multiple elements of procedure_concept_id in the procedure_occurrence table may correspond to a single element of concept_id in the “concept” table. This is a many-to-one relationship.
You can also have a one-to-one relationship.
To set the stage, let’s show two tables, x and y. We want to join them by the keys, which are represented by colored boxes in both of the tables.
In an INNER JOIN, we only retain rows that have elements that exist in both the x and y tables.
INNER JOIN syntax| person_id | procedure_occurrence_id |
|---|---|
| 343 | 3554 |
| 357 | 3741 |
| 399 | 3928 |
| 406 | 4115 |
| 411 | 4302 |
| 430 | 4489 |
| 442 | 4676 |
| 453 | 4863 |
| 469 | 5050 |
| 488 | 5237 |
FROM person and INNER JOIN procedure_occurrence specifies the tables to be joined.
ON person.person_id = procedure_occurrence.person_id specifies the columns from each table for keys.
We can short-hand the table names via the AT statement:
| person_id | procedure_occurrence_id |
|---|---|
| 343 | 3554 |
| 357 | 3741 |
| 399 | 3928 |
| 406 | 4115 |
| 411 | 4302 |
| 430 | 4489 |
| 442 | 4676 |
| 453 | 4863 |
| 469 | 5050 |
| 488 | 5237 |
LEFT JOINIf a row exists in the left table, but not the right table, it will be replicated in the joined table, but have rows with NULL columns from the right table.
We can see the difference between a INNER JOIN and LEFT JOIN by counting the number of rows kept after joining:
| count_star() |
|---|
| 37409 |
JOINsRIGHT JOIN is identical to LEFT JOIN, except that the rows preserved are from the right table.FULL JOIN retains all rows in both tables, regardless if there is a key match.ANTI JOIN is helpful to find all of the keys that are in the left table, but not the right tableJOINsCan we do a triple join?
Suppose that we want a table with person.person_id, procedure_occurrence.procedure_occurrence_id, and concept.concept_name.
Some suggested steps:
INNER JOIN person and procedure_occurrence, to produce an output tableINNER JOIN it with concept.JOIN with WHERELet’s add an additional WHERE where we only want those rows that have the concept_name of ’Subcutaneous immunotherapy`:
| person_id | procedure_occurrence_id | concept_name |
|---|---|---|
| 16 | 289 | Subcutaneous immunotherapy |
| 180 | 1958 | Subcutaneous immunotherapy |
| 9 | 187 | Subcutaneous immunotherapy |
| 5 | 119 | Subcutaneous immunotherapy |
| 36 | 559 | Subcutaneous immunotherapy |
| 124 | 1226 | Subcutaneous immunotherapy |
| 225 | 2244 | Subcutaneous immunotherapy |
| 409 | 4243 | Subcutaneous immunotherapy |
| 236 | 2392 | Subcutaneous immunotherapy |
| 260 | 2556 | Subcutaneous immunotherapy |
WHERE: AND versus ORRevisiting WHERE, we can combine conditions with AND or OR.
AND is always going to be more restrictive than OR, because our rows must meet two conditions.
| count_star() |
|---|
| 1261 |
On the other hand OR is more permissive than AND, because our rows must meet only one of the conditions.
| count_star() |
|---|
| 2629 |
ORDER BYORDER BY lets us sort tables by one or more columns:
| person_id | procedure_occurrence_id | procedure_date |
|---|---|---|
| 1 | 1 | 1981-08-17 |
| 1 | 2 | 1982-09-11 |
| 1 | 3 | 1981-08-10 |
| 1 | 4 | 1958-03-11 |
| 1 | 5 | 1958-03-11 |
| 2 | 6 | 1955-10-22 |
| 2 | 7 | 1977-04-08 |
| 2 | 8 | 1931-09-03 |
| 2 | 9 | 2007-09-04 |
| 2 | 10 | 1924-01-12 |
Once we sorted by person_id, we see that for every unique person_id, there can be multiple procedures! This suggests that there is a one-to-many relationship between person and procedure_occurrence tables.
Some constraints we can require on columns of a table:
INTEGER, VARCHARNOT NULL - no values can have a NULL value.UNIQUE - all values must be unique.PRIMARY KEY - NOT NULL and UNIQUE.FOREIGN KEY - value must exist as a primary key in another table’s field. The referenced table’s field must be specified.CHECK - check the data type and conditions. One example would be our data shouldn’t be before 1900.DEFAULT - default values are given if not provided.A PRIMARY KEY is required for any table, and cannot be NULL and must be unique. This gives an unique id for each entry of the table.
When we create tables in our database, we need to specify which column is a PRIMARY KEY:
CREATE TABLE person (
person_id INTEGER PRIMARY KEY
)
FOREIGN KEY involves two or more tables. If a column is declared a FOREIGN KEY, then that key value must exist in a REFERENCES table as a primary key.
CREATE TABLE procedure_occurrence {
procedure_occurrence_id PRIMARY KEY,
person_id INTEGER REFERENCES person(person_id)
procedure_concept_id INTEGER REFERENCES concept(concept_id)
}
When we’re done, it’s best to close the connection with dbDisconnect().